Background

This markdown is a preliminary investigation of the data provided by the Office of the Commissioner for Public Sector Employment.The data contains details from every advertisement on iworkforsa.gov.au.

Note that these data have been classified as "Public" by the OCPSE.

Packages

library(readxl)
library(tidyverse)
library(here)
library(lubridate)
library(kableExtra)
library(naniar)
library(patchwork)
library(skimr)
library(DT)
library(ComplexUpset)
library(tidytext)
library(tidyr)

Import, arrange and format

path <- here("IWORK4SA Data 2017-2021, Classification-Public.xlsx")

advertiments <- path %>% 
  excel_sheets() %>%
  map_dfr(
    function(sheet) {
      temp = read_excel(sheet, path = path, skip = 8)
      mutate(temp, Year = sheet)
    }
  )

Checks and balances

Check the top and bottom of the data

Data read in from xlsx files often has bits of extra information above and below the actual "data". To check that we haven't accidentally imported that we can have a look at the top and bottom of our data set.

datatable(advertiments)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

It looks like we've imported some extra rows that we don't want because they were footer details and some NA whole rows. We can remove them.

advertiments_clean <- advertiments%>%
  filter(str_detect(`Date Publish To Public Drill`, "Report created") == FALSE | is.na(`Date Publish To Public Drill`) == TRUE) %>%
  filter(is.na(`Position ID Job Title`) == FALSE)

Just to check this again. We imported five sheets from the xlsx file so advertiments_clean should now fave five less rows than advertisments. And, the difference in rows is: 10!

Change format

Let's put the variables in a format that we can treat appropriately.

advertiments_formatted <- advertiments_clean %>% 
  mutate(`Date Publish To Public Drill` = ymd(`Date Publish To Public Drill`)) %>%
  mutate(`I Workfor SA Min Salary` = as.numeric(`I Workfor SA Min Salary`)) %>%
  mutate(`I Workfor SA Max Salary` = as.numeric(`I Workfor SA Max Salary`))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

What are we working with

It would be good to have a look at the data types in each variable to work out if its suitable for what we'll want to do in the analysis.

advertiments_formatted %>%
  skim()
Data summary
Name Piped data
Number of rows 35532
Number of columns 15
_______________________
Column type frequency:
character 12
Date 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Division 0 1.00 7 55 0 69 0
Position Status 0 1.00 4 32 0 9 0
I Workfor SA Classification Level 887 0.98 9 74 0 325 0
Position ID 0 1.00 6 6 0 35532 0
Position ID Job Title 0 1.00 8 163 0 35532 0
I Workfor SA Vacancy Type 448 0.99 16 31 0 2 0
I Workfor SA Part Time 0 1.00 2 3 0 2 0
I Workfor SA Part Time Hours 21410 0.40 1 14 0 383 0
I Workfor SA Location 40 1.00 11 44 0 320 0
I Workfor SA Employment Status 451 0.99 6 19 0 4 0
I Workfor SA Job Category 453 0.99 6 62 0 41 0
Year 0 1.00 4 4 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date Publish To Public Drill 2384 0.93 2017-11-04 2021-06-08 2019-09-24 1017

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
I Workfor SA Min Salary 826 0.98 75939.52 30251.57 0 60681 75430 92784 324141 <U+2586><U+2587><U+2581><U+2581><U+2581>
I Workfor SA Max Salary 826 0.98 87480.98 38156.83 0 66368 88420 101685 413381 <U+2585><U+2587><U+2581><U+2581><U+2581>

Missing data

advertiments_formatted %>%
  gg_miss_var(show_pct = TRUE, facet = Year) +
  labs(title = str_wrap("Percent of missing data for each variable in `advertisments`", width = 50))

It seems that all the variables except I Workfor SA Part Time Hours have minimal missing values and none of this changes much over time.

gg_miss_upset(advertiments_formatted)

Its reassuring that I Workfor SA Min Salary and I Workfor SA MAX Salary are only missing in combination with each other (and sometimes with I Workfor SA Part Time Hours as well).

The only note that appears particularly pertinent from these analyses is that I Workfor SA Part Time Hours should be treated with caution. We should note that the variables: Division, Position Status, Position ID, Position ID Job Title, I Workfor SA Part Time and Year are all completed with no missing values.

Analyses

"Data" jobs

The motivation for getting these data that Owen and Tim talked about was to look at advertisements for "data" jobs to investigate questions such as:

  • Are they increasing as a proportion of the advertisements?
  • What departments are recruiting them?
  • What sort of work are they being anticipated to do?

Definition caveats

To answer this question we need a definition of "data" jobs.

We can do this by including words and word-parts that signify the sort of job we are interested in. It is worth stating up front that this is necessarily a decision process rather than a data process. That is, the jobs we end up with classified as "data" and "not data" will be so classified because of our subjective decision not because of some objective data. In addition, this has a signal detection characteristic in that sue to that application of a strict definition, there will be errors both due to some "not data" jobs which end up classified as "data jobs" and some "data jobs" that end up classified as "not data jobs".

Definition

Regardless, we can set the definition and accept these caveats. Importantly, because the new variable Defined as a data job is based on Position ID Job Title, there will be no missing values in this new variable. Its worth noting that this step can (and should) be changed as we think more or less terms are applicable.

Its worth noting that there is a variable called I Workfor SA Job Category that is fairly complete. Because its not actually complete, its not suitable for basing our definition on. But its useful for a source of inspiration!

Data jobs will be defined as such:

data_job_definitions <- c("data", "analy", "information", "intelligence", "statistic", "scient", "research")

Then we can append this definition to the data frame:

advertiments_defined <- advertiments_formatted %>% 
  mutate(`Defined as a data job` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                                     paste(data_job_definitions, collapse = "|")
                                                     ), 
         "Data", "Not data"))

As a check on our definitions, we should see how often each occur.

data_job_type <- advertiments_defined %>%
  select(`Position ID Job Title`) %>%
  mutate(`Is "data"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[1]), 1, 0)) %>%
  mutate(`Is "analy"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[2]), 1, 0)) %>%
  mutate(`Is "information"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[3]), 1, 0)) %>%
  mutate(`Is "intelligence"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[4]), 1, 0)) %>%
  mutate(`Is "statistic"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`), 
                                         data_job_definitions[5]), 1, 0)) %>%
  mutate(`Is "scient"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`),
                                         data_job_definitions[6]), 1, 0)) %>%
  mutate(`Is "research"` = ifelse(str_detect(str_to_lower(`Position ID Job Title`),
                                         data_job_definitions[7]), 1, 0))

data_job_type %>%
  select(-`Position ID Job Title`) %>%
  pivot_longer(cols = everything(), names_to = "Data job type", values_to = "Count") %>%
  group_by(`Data job type`) %>%
  summarise(Count = sum(Count)) %>%
  ggplot(aes(x = reorder(`Data job type`, -Count), y = Count, fill = `Data job type`)) + 
  geom_col() + 
  theme(legend.position = "none") + 
  labs(x = "Data job type", title = "Counts of data job types")

This seems pretty sensible. Lets also check how they join together within a singe Position ID Job Title.

data_job_type %>%
  select(-`Position ID Job Title`) %>%
  filter_all(any_vars(. == 1)) %>%
  upset(colnames(data_job_type[2:8]))

So there were 5 "data scientists". Seems pretty sensible.

Answers

Are "data jobs" increasing as a proportion of the advertisments?

increasing_data <- advertiments_defined %>%
  group_by(`Defined as a data job`, Year) %>%
  count(name = "Number of advertisments") %>%
  group_by(Year) %>%
  mutate(`Proportion within each year`= `Number of advertisments`/sum(`Number of advertisments`))

count_plot <- increasing_data %>%
  ggplot(aes(x = Year, y = `Number of advertisments`, colour = `Defined as a data job`, group = `Defined as a data job`)) + 
  geom_line() + 
  scale_y_log10() + 
  labs(title = "Count of jobs on iworkforsa that  each year by definition")

prop_plot <- increasing_data %>%
  ggplot(aes(x = Year, y = `Proportion within each year`, colour = `Defined as a data job`, group = `Defined as a data job`)) + 
  geom_line() + 
  scale_y_log10() + 
  labs(title = "Proportion of jobs on iworkforsa each year by definition")

count_plot/prop_plot

increasing_data %>%
  filter(`Defined as a data job` == "Data") %>%
  kable() %>%
  kable_styling()
Defined as a data job Year Number of advertisments Proportion within each year
Data 2017 75 0.0372949
Data 2018 458 0.0501478
Data 2019 500 0.0557414
Data 2020 519 0.0506885
Data 2021 286 0.0552230

This certainly doesn't make it look as though there has been an explosion in the proportion of "data" jobs advertised over the last four years. There were 61 more "data" jobs in 2020 than in 2018. While this is a 88.2466281 percent increase in the number of these jobs that were advertised, it is worth noting that this is only an increase from a percent of 5.0147816 in 2018 to 5.0688544 in 2020 of the overall number of jobs that were advertised. That is an increase in just 0.0540728 percent.

What departments are recruiting them?

departments_data <- advertiments_defined %>%
  group_by(Division, `Defined as a data job`) %>%
  count(name = "Number of advertisments in division") %>%
  ungroup() %>%
  group_by(Division) %>%
  mutate("Division total" = sum(`Number of advertisments in division`)) %>%
  ungroup() %>%
  mutate("Percent" = (`Number of advertisments in division`/`Division total`)*100)

most_data_departments <- departments_data %>%
  filter(`Defined as a data job` == "Data") %>%
  slice_max(order_by = Percent, n = 12) %>%
  pull(Division)

departments_data  %>%
  filter(Division %in% most_data_departments) %>%
  mutate(Division = factor(Division, levels = most_data_departments, ordered = TRUE)) %>%
  ggplot(aes(x = Division, 
             y = Percent, 
             fill = `Defined as a data job`)) + 
  geom_col(position = "dodge") + 
  labs(title = "Divisions with the greatest proportion of 'data' job advertiments", 
       x = "") + 
  scale_x_discrete(limits=rev) +
  coord_flip()

A more comprehensive investigation of these data can be made in the table below:

datatable(departments_data)

So, the Department of Health and Wellbeing has the greatest number of "data" job advertisements: 905. But, it also has one of the smallest proportions of data job advertisments: 4.0795168%.

It is worth noting that there are mustiple explanations for the differences in the number of advertiments of "data" jobs between the Divisions. * Divisions that "data" people like working in may keep their staff for longer and so advertise less * Some divisions may have recruited more "data" people in an earlier time period * Some divisions may not be hiring as many "data" people as they think they are

What sort of work are they being anticipated to do?

Using the Position ID Job Title, we can see what other words appear alongside those used to define Defined as a data job.

Lets start by looking at the words most frequently occuring in those jobs that we classified as Defined as a data job == "Data".

advertiments_unnest <- advertiments_defined %>%
  select(`Defined as a data job`, `Position ID Job Title`) %>%
  unnest_tokens(Word, `Position ID Job Title`)


advertiments_unnest %>%
  filter(`Defined as a data job` == "Data") %>%
  count(Word, name = "Number of occurances") %>%
  slice_max(order_by = `Number of occurances`, n = 12) %>%
  ggplot(aes(x = reorder(Word, `Number of occurances`), y = `Number of occurances`, fill = Word)) +
  geom_col() + 
  coord_flip()  + 
  theme(legend.position = "none") + 
  labs(x = "", title = "Most used words used in 'data' job advertisments")

It would be more instructive to see the words that most differentiate between "Data" and "Not data" jobs:

advertiments_log <- advertiments_unnest %>%
  count(`Defined as a data job`, Word) %>%
  pivot_wider(names_from = `Defined as a data job`, values_from = n, values_fill = 0) %>% 
  mutate(log_ratio = log2(Data / `Not data`)) %>%
  filter(log_ratio != Inf) %>%
  filter(log_ratio != -Inf)

neg_log <- advertiments_log %>%
  slice_max(order_by = log_ratio, n = 10)

pos_log <- advertiments_log %>%
  slice_min(order_by = log_ratio, n = 10)

neg_log %>%
  bind_rows(pos_log) %>%
  mutate(pos = log_ratio >= 0) %>%
  ggplot(aes(x = reorder(Word, log_ratio), y = log_ratio, fill = log_ratio > 0)) + 
  geom_col() + 
  coord_flip()  + 
  theme(legend.position = "none") + 
  labs(x = "", y = "Log ratio (positive is more 'Data')", 
       title = "Words that most discriminate between 'Data' and 'Not data' jobs")